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The study of dealing with searching information in documents within web 
pages is information retrieval (IR). The user needs to describe information with 
comments or reviews that consists of a number of words. Discovering weight 
of an inquiry term is helpful to decide the significance of a question. 
Estimation of term significance is a basic piece of most information retrieval 
approaches and it is commonly chosen through term frequency-inverse 
document frequency (TF-IDF). Also, improved TF-IDF method used to 
retrieve information in web documents. This paper presents comparison of 
TF-IDF method and improved TF-IDF method for information retrieval. 
Cosine similarity method calculated on both methods. The results of cosine 
similarity method on both methods compared on the desired threshold value. 


Information retrieval 
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TF-IDF method 
we This is an open access article under the CC BY-SA license. 
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The relevance documents of TF-IDF method are more extracted than improved 
TF-IDF method. 
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1. INTRODUCTION 

Information retrieval (IR) is answerable for capacity and retrieval of large amounts of data in a 
productive way. Most IR structures figure a numeric score on how well each article in the database orchestrate 
the ask, and rank the things as showed by this worth. The top arranging articles are then appeared to the user. 
The strategy may then be iterated if the user wishes to refine the request. Objective of IR is to find records 
suitable to data need from a huge review set. A significant deficiency of retrieval pattern information 
recuperation procedure is that the language that searchers use is as often as possible not equal to the one by 
which the information has recorded. An enormous part of the current artistic information recovery approaches 
depends upon a lexical match between words in customer's sales and words in target objects. 

In this methods, archives and inquiries has spoken to as vectors, with every part inside the vector 
taking on a worth dependent on the nearness or non-attendance of a word inside the content. To decide the 
importance of a record for a given question, a similitude activity (ordinarily a spot item) led on the vectors 
yielding a solitary number. In vector space model, feature is weighted by using numbers, some commonly used 
with weighting methods, such as weighed of boolean, weighted of frequency, weight of term frequency-inverse 
document frequency (TF-IDF), weight of term frequency in class (TFC), weight of log-weighted term 
frequency (LTC), entropy weighting. TF-IDF weighting is the most commonly used one among them [1, 2]. 
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2. RELATED WORKS 

The focus of must be researched in retrieval of information is the retrieval of data from unstructured 
data sources. Intuitively, this is a much harder problem than structured data retrieval. Usually, the Boolean 
retrieval model relegates 1 or 0 subject to the proximity or non-appearance of the terms in a document. This 
model performs tragically in addressing for a document. A short time later, the vector space model introduced 
for situated retrieval. It is broadly utilized in questioning documents, bunching, arrangement and other data 
recovery tasks since it is basic and straightforward. 

Nowadays extraordinary trust figuring methods are available. Many complexities enrolling methods 
for trust have been proposed by researchers from different ways; by the by, most by a wide margin of them 
essentially measure certain trust related course and go along with them into a trust a spark by translating a store 
for each course. Different works exit on the issue of trust estimation (some may decipher it as trust esteems or 
trust computing) [2-4]. 

We-Intention, moral trust and self-motivation [5] depends on inspirational components. It has been 
checked by Cronbach's a, squared multiple correlations (SMC). Cronbach's a ascertain the estimation of 
credited to confide in score change that is an extent of the difference in the test score. The central work is to 
take a gander at and separate the rule segments of influencing data sharing development in social joint exertion. 

Lu S. [1] proposed an improved approach method tf-idf IG to remedy this defect by information gain 
from information theory. This paper conquers is the restriction of old tf-idf. The idf can't well show 
discriminative and significance of highlight, weight change strategy is advanced in which the IDF work 
supplanted by assessment function used in feature selection. 

A social context-aware trust sub-network extraction model processes to discover close ideal 
arrangements viably and proficiently utilizing ant colony algorithm (ACA) heuristic techniques [6]. The gold 
of this outcome extricated sub-systems inside the comparable execution time. An improved direct trust 
assessment technique dependent on the leader-follower clustering algorithm for context-aware trust 
model [7]. It attempted to tackle the issue of information sparsity issue brought about by the decent variety of 
services and contexts. 

A collaborative filtering algorithm calculates dependent on network factorization and multi-way trust 
degree combination. It utilized a blend of strategy the lattice decay procedure and informal communities trust 
model [8]. The fundamental work is the issue of low proposal exactness brought about by the high 
conditionality of information. Different strategies predict the trust score of source user on track user by 
proliferating. Expected to discover various calculations to compute the most confided in way in companion of 
companion additionally in as least time-weight and credit assessment [9]. 


3. BACKGROUND THEORY 

The two essential IR models utilized during the time spent retrieving data: Boolean, and probabilistic. 
Likewise, these two ways attempt to order archives dependent on their pertinence to the client's needs. Among 
the most mainstream of approaches, because of straightforwardness and relative viability, are vector put 
together these ways based with respect to a plan using the recurrence of words in the record and report 
assortment. One of these plans is known as tf-idf. While famous, tf-idf plans may additionally improve the 
viability of this method [10-12]. 


3.1. Term frequency-inverse document frequency (TF-IDF) method 

TF-IDF is a customary methodology which is utilized to discover the term significance by discovering 
weight of a term. Steps to discover weight of a question and terms in web reports utilizing vector space model 
are as per the following: 
— Removed tag for desire page. 
— Remove stopwords. 
— Tokenized the given sentences. 
— Apply with poster stemming calculation. 
— Term frequency (TF) calculation for each review (query) (q) within a query (Q) from (web page) document. 
— Inverse document frequency (IDF) calculation of each term in the web pages (query (Q)) 
— Compute TF-IDF of each term of query using (1) and (2). 

Term recurrence (TF) is basically a rate signifying the number of times a word appears in a document. 
It is numerically communicated as appeared in (1). Where, n; is the number of occurrences of the considered 
term and max is the count of the term with maximum occurrences in the web page. Inverse document frequency 
(IDF) considers that numerous words happen commonly in numerous archives. IDF is numerically 
communicated as appeared in (2). Where, D is the total pages number and d: tied is the quantity of pages containing 
the term and log is based on 10. 
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3.2. Improved TF-IDF method 

TF-IDF has the obvious deficiency if an important term appears in all the documents, so the 
denominator value will be small or zero according to IDF method [13]. For this reason, use the improved TF- 
IDF method to calculate the weight of each term. IDF defined avoid the zero in the result, therefore, it describes 
as illustrate as (3). To calculate the weight of each term, the improved weight formula is defined as (4), where, 
wiis representing the weight of of the term utilizing improved TF-IDF weighting strategy where for each term, 
and K=N-ni, N is the complete number of pages and ni is the quantity of pages that the term happens in. 


IDF = log (Grea + 0.01) (3) 


hi 
T Fj *IDF;*log(10+—)*N 
w; = eee (4) 
Dia (TFi*IDF;) 


4. PROPOSED SYSTEM DESIGN 
In system, users can search relevance social commerce web pages through the desired query as shown 
in Figure 1. An information retrieval system is utilized to retrieve information appropriate to a client's needs 
founded on inquiries presented to the framework by the client. The accompanying advances are proposed 
framework plan of this framework: To retrieve relevance web documents, a user poses a query to the system. 
— This query is parsed by the system, and is then used to select relevance documents from the document 
collections 
— Before parsing the query from the user, web document store often called to as the document collection in 
database, and these web documents extracted web contents from the database 
— After that, the system performs two stages, namely, pre-processing stage and calculation of weights of terms 
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Figure 1. Proposed system design for comparison of improved TF-IDF method and TF-IDF method 
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In pre-processing stage: when hypertext markup language (HTML) documents income into the 
system, this document is tokenized. Tokenizing is to separate the blocks in this document, and remove the 
HTML tags. After tokenization, stop-words are removed. Stop-words include articles, pronouns, some of the 
verbs, nouns and adjectives. Poster stemming algorithm means a process for removing suffixes from words in 
English, to make text processing more efficient [14, 15]. This system is calculated weight of terms using (1) as 
TF, (2) as IDF and the product of TF and IDF in TF-IDF method. Also calculate weight of terms using (1), (3) 
and (4) of improved TF-IDF method [16]. For analyzing of comparison of improved TF-IDF method and 
IDF method based on weights of terms and then computes relevance calculation using cosine similarity method 
according to the results of relevance calculation and user’s threshold value. 


5. TRUST EVALUATION 

In this system, vector space model such as TF-IDF method and improved IF-IDF method determines 
the relevant web pages and irrelevant web pages using cosine similarity [17]. Let D={OS,, OS2..., OSm} be a 
set of users reviews from online social commerce pages. OS means online social commence pages. Reviews 
from each page is uploaded to do preprocessing steps. The content data of five reviews are as following tables. 


5.1. Calculate with TF-IDF method 

The first TF increases by IDF is TF-IDF, where IF and IDF are term frequency short term and inverse 
document frequency respectively [18]. Firstly, the numbers of IT words count in each review are as shown in 
Table 1. The system is calculated TF and IDF using (1) and (2) as shown Table 2. And then, the weights of 
terms are calculated using the product of TF and IDF results as shown in Table 3. And then calculate relevant 
pages based on user query using cosine similarity method. Cosine measure is to compute between document 
vector and Qidf vector. 


Qidf =q:*(idf )ı (5) 


Where qimeans the number of words counts for each word in user comments and idf means inverse document 
query. The textual comparability between the page and the comments is the cosine similarity between ¢f*idf 
vector of the query and the ¢*idf vector of each page [19, 20]. 


n Oidf;*w; 
sim(Qidf, w) = —— trim és 
[Ei eiasfo? [aE ow? 














Table 1. Number of word counts Table 2. Result of TF and IDF 
Term OS1 OS2 OS3 OS4 OS5 Term OS1 OS2 OS3 OS4 OS5 IDF 
Good 0 1 0 0 4 Good 0 0.5 0 0 1 0.398 
Great 0 1 0 0 0 Great 0 0.5 0 0 0 0.6989 
Nice 3 2 2 0 3 Nice 1 1 1 0 0.75 0.0969 
Expensive 0 1 1 1 0 Expensive 0 05 0.5 05 0 0.2219 
Satisfied 0 1 1 2 1 Satisfied 0 0.5 0.5 1 0.25 0.0969 








Table 3. Weight of terms by TF-IDF 








Term OS1 OS2 OS3 OS4 OS5 
Good 0 0.198 0 0 0.398 
Great 0 0.349 0 0 0 
Nice 0.097 0.097 0.097 0 0.073 
Expensive 0 0.111 0.111 0.111 0 
Satisfied 0 0.048 0.048 0.097 0.024 





For calculation of cosine similarity method, compute Qidf vector based on user query using (5) as 
shown in Table 4. In this table, Q is the number of word counts in user query and Qidf is the product of word 
counts of each word and inverse document query values. And then, we calculate cosine similarity method using 
(6) as illustrate as Table 5. Therefore, all retrieved pages are rearranged to similarity values so that the relevant 
pages appear at top of the result set. 
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Table 4. Word counts and Qidf values 








Term Q Qidf 
Good 1 0.39794 
Great 0 0 
Nice 1 0.09691 
Expensive 0 0 
Satisfied 1 0.09691 





Table 5. Result of cosine similarity method in TF-IDF 
OS1 OS2 OS3 OS4 OSS 
0.230256 0.51413 0.215859 0.151493 0.983509 











5.2. Improved TF-IDF method 

TF values using (2) in improved TF-IDF method are the same values of TF-IDF method. The results 
of IDF values are calculated using (3) as shown in Table 6. The results of weight terms are calculated using (4) 
from values of TF and IDF in improved TF-IDF method. And then, the numbers of counts (Q) according to 
user query and calculate the results of Qidf. The cosine similarity method is calculated according to Q and Qidf 
based on user query as shown in Table 7. In this framework look at two strategies for unstructured inquiry 
positioning about their capacity to accurately disambiguate catch phrase question through systematic 
experiments. Document 5 is the top of the result set in relevant pages, but we analyzed that result value of 
cosine similarity method in improved TF-IDF method is less than the value in TF-IDF method. 


Table 6. Results of improved IDF 
Term OSI OS2 OS3 OS4 OS5 _ Improved IDF 








Good 0 0.5 0 0 1 0.399 
Great 0 0.5 0 0 0 0.699 
Nice 1 1 1 0 0.75 0.100 
Expensive 0 0.5 0.5 0.5 0 0.224 
Satisfied 0 0.5 0.5 1 0.25 0.100 





Table 7. Results of cosine similarity method in improved TF-IDF 
OS1 OS2 OS3 OS4 OSS 
0.23665 0.634956 0.258211 0.196793 0.97822 











6. EXPERIMMENTAL RESULTS 

In system experiments, to provide user flexible access and use relevant information, this system 
implements the results of two methods (improved TF-ID and TF-IDF) based on two domains such as bags 
social commerce and shoe social commerce. And then compare analysis of two methods using precision and 
recall as evaluation method [21, 22]. Recall is the ratio of the number of relevant topics retrieved to the total 
number of associate topics in database in (7). Precision is the proportion of the number of relevant topics 
retrieved to the irrelevant and associate with desire topics retrieved as (8). 





No.of Relevant Documents Retireved 
Recall = ia : (7) 
Total No.of Documents Retrieved 
eer No.of Relevant Document Retrieved 
Precision = —§——@@@—m@——™—_ (8) 


Total No.of Relevant Documents 


Experimental results include for both information retrieval with improved TF-IDF and TF-IDF methods 
to make performance analysis of social commerce domain among 50 usres reviews from 100 social commerce 
sites. Table 8 presents the experimental results with similarity threshold 0.1 for search using improved TF-IDF 
method [22]. When user searches information retrieval web pages using TF-IDF method produces better precision 
and recall as shown in Table 9. 
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Table 8. Experimental results of improved TF-IDF method (bags social commerce) 








Term Relevance Pages Particular Pages in DB_ Total Retrieved Pages Precision Recall 
Good 5 32 5 0.7650 0.1562 
Great 4 53 4 0.8059 0.7755 
Nice 4 38 4 0.6890 0.6053 
Expensive 11 40 11 0.8930 0.8275 
Satisfied 6 28 6 0.7532 0.6143 





Table 9. Experimental results of TF-IDF method (bags social commerce) 








Term Relevance Pages Particular Pages in DB Total Retrieved Pages Precision Recall 
Good 31 32 36 0.8611 0.9688 
Great 44 53 51 0.8627 0.8302 
Nice 33 38 37 0.8919 0.8684 
Expensive 31 40 42 0.7381 0.7751 
Satisfied 25 28 30 0.8333 0.8929 





Thus, in bags social commerce domain, we can see TF-IDF method is the best recall than the improved 
TF-IDF according to experimental result. In our experimental, we use 0.1 as threshold value for both TF-IDF 
method and improved TF-IDF method. When threshold value is higher (<0.1), recall and precision result is not 
good in improved TF-IDF method in Tables 10 and 11. In testing can see TF-IDF method is best recall than the 
improved TF-IDF method. Hence in implementation, use 0.01 threshold value for both TF-IDF method and 
improved TF-IDF method. When threshold value is higher, recall and precision result is not good in improved TF- 
IDF method. Based on the above results considerations, this paper’s experiments in the N value of 120, as a 
result of 5 social commerce feature words can described the subject information of a selected web pages. 


Table 10. Experimental results of improved TF-IDF method (shoes social commerce) (similarity threshold 0.01) 
Term Relevance Pages Particular Pages in DB Total Retrieved Pages Precision _ Recall 








Good 38 41 39 0.9744 0.9268 
Great 45 50 48 0.9375 0.9 

Nice 40 43 50 0.8677 0.9302 
Color 40 45 43 0.9302 0.8889 
Size 37 44 47 0.7872 0.8409 





Table 11. Experimental results of TF-IDF method (shoes social commerce) (similarity threshold 0.01) 
Term Relevance Pages Particular Pages in DB Total Retrieved Pages Precision _ Recall 








Good 39 41 47 0.8298 0.9512 
Great 48 50 57 0.8421 0.9605 
Nice 40 43 68 0.5882 0.9302 
Color 40 45 51 0.7843 0.8889 
Size 25 47 30 0.8333 0.5319 





The recall ratio and precision of TF-IDF algorithm and improved TF-IDF algorithm are shown in 
Figure 2 and Figure 3. After the calculation the result can be seen from the Figure 2, the recall rate of the 
improved TF-IDF algorithm. Also, can be seen from the Figure 3, the precision of the improved TF-IDF- 
compared with the classic TF-IDF algorithm. That can be seen from the test results, in similar conditions, 
utilizing the improved equation of overall performance is simple implementation, easy to understand 
calculation and strong explanatory that better than utilizing TF-IDF method [23-25]. 


0% 20% 40% 60% 80% 100% 
mImproved TF-IDF mTF-IDF 


Figure 2. Recall ratio 


TELKOMNIKA Telecommun Comput El Control, Vol. 19, No. 3, June 2021: 809 - 816 


TELKOMNIKA Telecommun Comput El Control O 815 





OS1 


0% 20% 40% 60% 80% 100% 
Improved TF-IDF om TF-IDF 


Figure 3. Precision 


7. CONCLUSIONS 

With the quick development of data on the World Wide Web, finding and retrieving valuable data turns 
into a significant issue. A web client may use the positioned site page list for exploring the web and finding pertinent 
pages. The point of this framework computes values from loads of the terms to give expectation of significant pages 
(social commerce web sites). This system examines relevance web pages by the comparison of TF-IDF method 
and Improved TF-IDF method. We conclude that nearly all relevant web documents based on user query extracted 
by using TF-IDF method. Likewise, TF-IDF accomplishment in foreseeing trust for user dependent on remarks in 
shares by post and furthermore vector space model is simpler to find the most confided in way as least time more 
conceivable than other and at the time of trust computation. Later on, user's audits are totally significant and find 
out progressively proper procedures dependent on this framework for trust in online communities. 
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